Estimating Photosynthetic Attributes from High-Throughput Canopy Hyperspectral Sensing in Sorghum

Sorghum, a genetically diverse C4 cereal, is an ideal model to study natural variation in photosynthetic capacity. Specific leaf nitrogen (SLN) and leaf mass per leaf area (LMA), as well as, maximal rates of Rubisco carboxylation (Vcmax), phosphoenolpyruvate (PEP) carboxylation (Vpmax), and electron transport (Jmax), quantified using a C4 photosynthesis model, were evaluated in two field-grown training sets (n = 169 plots including 124 genotypes) in 2019 and 2020. Partial least square regression (PLSR) was used to predict Vcmax (R2 = 0.83), Vpmax (R2 = 0.93), Jmax (R2 = 0.76), SLN (R2 = 0.82), and LMA (R2 = 0.68) from tractor-based hyperspectral sensing. Further assessments of the capability of the PLSR models for Vcmax, Vpmax, Jmax, SLN, and LMA were conducted by extrapolating these models to two trials of genome-wide association studies adjacent to the training sets in 2019 (n = 875 plots including 650 genotypes) and 2020 (n = 912 plots with 634 genotypes). The predicted traits showed medium to high heritability and genome-wide association studies using the predicted values identified four QTL for Vcmax and two QTL for Jmax. Candidate genes within 200 kb of the Vcmax QTL were involved in nitrogen storage, which is closely associated with Rubisco, while not directly associated with Rubisco activity per se. Jmax QTL was enriched for candidate genes involved in electron transport. These outcomes suggest the methods here are of great promise to effectively screen large germplasm collections for enhanced photosynthetic capacity.


Introduction
Sorghum (Sorghum bicolor L. Moench), a C 4 pathway species and the world's fifth most produced cereal [1], is adapted to a range of environments and retains high photosynthetic efficiency in diverse conditions [2][3][4]. These characteristics make it a crop of interest for the dual challenge of meeting increasing demands for food and adapting to the effects of climate change [5,6]. In addition to the C 4 pathway, which confers adaptation to hot and dry environments, the natural genetic diversity of sorghum provides potential to identify genotypes or genetic loci associated with greater photosynthetic capacity [7]. However, in order to select the photosynthetically favourable genotypes adapted to contrasting environments, tools are required to quantify the biochemical parameters underpinning photosynthetic capacity in a high-throughput manner, removing the phenotyping bottleneck with the traditional gas exchange approach.
Photosynthesis is the process of converting captured solar radiation into chemical energy by fixing carbon dioxide (CO 2 ) to form carbohydrates and biomass. Improving photosynthetic capacity is seen as a major target to further improve crop yields [2,3,8]. Screening germplasm to directly breed for improved photosynthetic responses to environment conditions is constrained by the complexity of measuring such responses and requires development of higher-throughput indirect phenotyping techniques.
In the C 4 photosynthetic pathway, the biochemical processes in the mesophyll cells are coordinated with a CO 2 concentrating mechanism in the bundle-sheath cells [9,10]. In the mesophyll, CO 2 is initially fixed by phosphoenolpyruvate (PEP) carboxylase into C 4 acids, which are then decarboxylated in the bundle sheath cells leading to high CO 2 levels and hence more efficient carboxylation of Ribulose-1,5-bisphosphate (RuBP) by Ribulose 1,5-bisphosphate carboxylase-oxygenase (Rubisco) [11,12]. The energy for the regeneration of RuBP in the bundle sheath and PEP in the mesophyll comes from chloroplast electron transport [11]. Due to their key roles in the photosynthetic pathway, the maximal rates of Rubisco carboxylation (V cmax , μmol m -2 s -1 ), PEP carboxylation (V pmax , μmol m -2 s -1 ), and maximal electron transport rate (J max , μmol m -2 s -1 ) largely determine photosynthetic capacity of C 4 plants and therefore underpin crop productivity. Simulations using a diurnal canopy photosynthesis model predict that canopy growth rate of C 4 cereals responds largely to changes in J max [13]. Quantification of these biochemical parameters is hence of value for selecting enhanced photosynthesis and growth. This is traditionally achieved by conducting gas exchange measurements and fitting observed photosynthetic responses to CO 2 or light with the Rubisco-activity or electron-transport limited equations in the C 4 photosynthesis model [11,14]. However, this method is very timeconsuming and not suitable for high-throughput screening of large germplasm collections.
The capacity of leaves to convert absorbed CO 2 and radiation into biomass also depends on key leaf physiological and structural properties [15]. Two such properties are specific leaf nitrogen (SLN, g m -2 ) and leaf mass per leaf area (LMA, g m -2 ), and both of these are known to be closely associated with photosynthetic capacity [16,17]. Because nitrogen is a key element in photosynthetic machinery, such as chloroplasts, plant nitrogen status closely links with leaf photosynthetic rates and canopy radiation use efficiency [18][19][20] and is hence an important parameter in canopy performance modelling [13,21]. The relationship between leaf nitrogen content and maximal net photosynthesis rate is influenced by LMA which is strongly associated with leaf lifespan and thus affecting the rates of the photosynthetic parameters [15,16,22]. However, conventional measurements of SLN and LMA are destructive and slow, limiting their potential to identify germplasm with higher photosynthetic capacity in large breeding programs.
High-throughput plant phenotyping technologies enable the collection of plant biochemical and physiological traits rapidly and nondestructively at large scale [23][24][25][26]. Various vegetation indices, which are usually calculated using a few selected wavelengths, have been correlated with plant structural traits (e.g., leaf area index and biomass) or leaf pigment concentration (e.g., chlorophyll). Typical canopy size indicators include normalized difference vegetation index (NDVI) [27,28] and optimized soil adjusted vegetation index (OSAVI) [29]. Chlorophyll content, on the other hand, has been indicated by indices, such as normalized difference red edge (NDRE) [30] and chlorophyll vegetation index (CVI), which is an indirect measure of nitrogen content [31]. Adjustments to these vegetation indices have also been reported. For example, replacing red bands with red edge when calculating some indices exhibited better performance in estimating chlorophyll content [32].
More recently, hyperspectral imaging sensors with wavelengths in the visible (400-700 nm), near infrared (700-1000 nm), and shortwave infrared (1000-2500 nm) domain have advanced the development of high-resolution spectroscopy techniques. This has led to significant increases in the accuracy and the types of physiological properties that can be retrieved [26,33]. The linkage between photosynthetic capacity and hyperspectral features therefore constitutes a promising avenue to predict photosynthetic performance of plants across broad scales [20,[34][35][36]. Various studies have exploited the plethora of bands (>270) and the much narrower band width (<6 nm) available from current hyperspectral sensors to better quantify biochemical and physiological properties in crops [35,37]. However, most of the studies so far use hyperspectral reflectance to estimate leaf photosynthetic capacity in C 3 crops [34,35,[37][38][39][40][41], and similar studies are much rarer for C 4 crops. At least one study focused on V cmax , V pmax , leaf nitrogen content, and specific leaf area from whole spectra reflectance (500-2400 nm) using partial least square regression (PLSR) in C 4 crop maize [42]. However, J max that quantifies the rate of electron-transport limited photosynthetic rate [11] is also important in determining daily biomass growth [13], but has not previously been targeted.
A more comprehensive study on quantifying the key parameters of photosynthesis V cmax , V pmax , and J max in a C 4 crop species is proposed. In addition, a highthroughput method to predict key parameters linked to photosynthetic capacity from canopy-level hyperspectral measurements will aid in the selection of genetic material with improved photosynthetic capacity at a large scale. To our knowledge, there are no published previous attempts to estimate the full set of key parameters known to limit C 4 photosynthesis, at canopy level, using hyperspectral reflectance. Additionally, next generation sequencing techniques have provided a high-throughput and cost-efficient tool for detecting genomic regions associated with crop traits of interest via genome-wide association studies (GWAS) [43][44][45]. Combining the techniques of hyperspectral sensing and GWAS would greatly facilitate the improvement of photosynthetic capacity and ultimate crop performance, which to date has rarely been explored.
The main objective of this study was to estimate traits associated with photosynthetic capacity from proximal hyperspectral sensing of sorghum canopies. Specifically, we aimed to (i) develop algorithms to predict photosynthetic parameters (V cmax , V pmax , and J max ), SLN, and LMA from proximal hyperspectral canopy reflectance captured with a spectrometer attached to a mobile phenotyping platform in two field-grown training sets; (ii) extrapolate the algorithms to GWAS trials grown adjacent to the training sets using a fully genotyped sorghum diversity panel; (iii) evaluate the heritability of the predicted traits; and (iv) undertake GWAS to detect genomic loci associated with the key 2 Plant Phenomics photosynthetic parameters and identify potential candidate genes to assess the usefulness and robustness of the approaches used in this study.

GWAS Trials.
Two field experiments were conducted during two consecutive summer seasons (2019 and 2020) at Gatton Research Station (GAT), Gatton, Queensland, Australia (27°33′ S, 152°20′E, 94 m above sea level). GAT1 and GAT2 were sown on 14 January 2019 and 12 November 2019, respectively. Both trials were designed using partial replication with spatially randomised genotypes arranged in rows and columns. There were 875 plots, including 650 genotypes in GAT1, and 912 plots, including 634 genotypes in GAT2, with 70 genotypes in common between trials ( Table 1). The genotypes in GAT1 were all inbred lines (n = 649) from a sorghum diversity panel comprising world-wide collections [43], and one hybrid was also included. In GAT2, 89% genotypes were hybrids from the Queensland breeding program, and the rest were inbred lines from the sorghum diversity panel. Each plot (4.5 m length and 3 m width) sown to a genotype consisted of four rows. Both trials were planted with a GPS precision planter at a population density of 108,000 plants ha -1 . For both trials, 150 kg of nitrogen per hectare was applied preplanting, and plots were irrigated regularly to provide nutrient and water nonlimiting conditions. The temperature, photosynthetic photon flux (PPF), and relative humidity (RH) from 6 am to 6 pm for the duration of each trial are shown in Table 1.

Training
Sets. Adjacent to each of the GWAS trials, a training set comprising a representative sample of the lines in the GWAS trials was used to collect ground truth data for association with hyperspectral measurements. Completely randomised block designs (row-column) were also used in the training sets. The middle two rows (0.63 m row spacing) of each four-row plot were used for the ground truth data collection while the outside two rows (0.75 m row spacing) were guard rows. The training set in 2019 (TS1) consisted of 80 plots comprising 60 genotypes which were all inbred lines and also included in GAT1. In the training set of 2020 (TS2), there were 108 plots with 93 genotypes of which 63 (68%) were hybrids. There were 19 genotypes in common between TS1 and TS2. Due to differences in germination and vigour of the diverse germplasm used, there was substantial variability in final plant establishment in both trials. The ground truth measurements were only taken from plots which had good establishment, which reduced the number of possible observations that could be used to develop the models. To maximise the number and the range of observations, the ground truth data from TS1 and TS2 were pooled.

Ground Truth Measurements in the Training Sets.
In both trials, gas exchange measurements were taken under mostly cloudless conditions (between 9 am and 12 pm) between 35 and 50 days after sowing (DAS)), which was during the active vegetative growth period for all genotypes and hence before the switch to reproductive growth which may introduce physiological and metabolic changes, but after full canopy closure. This period is known to be the most critical period for grain production in sorghum [46]. In total, 75 CO 2 (ACi) and 75 light (Ai) response curves were collected across TS1 (n = 31 plots comprising 29 inbred lines) and TS2 (n = 44 plots comprising 30 hybrid and 10 inbred lines) with six inbred lines in common between TS1 and TS2. One plant per plot was randomly selected for gas exchange measurements. The ACi curves were performed on the last or second last fully expanded leaf using a LI-6400 (LI-COR, Inc., Lincoln, Nebraska USA) with a 6400-02B Red/Blue LED light source illuminating a leaf chamber of 6 cm 2 . To measure ACi curves, photosynthetically active radiation (PAR) was set at 1800 μmol photons m -2 s -1 , flow rate through the chamber at 500 μmol mol -1 , and temperature was set to leaf temperature measured at the commencement of each curve. Vapour-pressure deficit (VPD) was generally held at around 3.0 kPa, by adjusting the scrubbing of the incoming air via the desiccant. For each ACi curve, the reference CO 2 levels were set to the sequences of 200, 100, 50, 250, 400, 650, 800, 1000, 1200, and 1400 ppm, with a duration of 1-5 min for each step. Measurements were made at each CO 2 supply point when gas exchange had equilibrated, at which point, the coefficient of variation for the CO 2 concentration differential between the sample and reference analysers was below 1%. The light levels for the Ai curves were set at 2000, 1500, 1000, 500, 250, 120, 60, 30, 15, and 0 μmol m -2 s -1 . The other controls were set as follows: reference CO 2 (constant at 400 μmol mol -1 ), flow (500 μmol mol -1 ), temperature was set to leaf temperatures, and humidity was controlled by scrubbing incoming air to maintain a VPD around 3.0 kPa. The duration for every light level was 1-3 min. Sample and reference analysers were matched before each data point was logged.
A small square section of the leaf (1.6 cm 2 ) was collected with a leaf punch from the same leaf section as was used for gas exchange measurements. The leaf sections were dried at 80°C and weighed to calculate LMA (g m -2 ). Percent nitrogen of each sample was determined with a continuous flow isotope ratio mass spectrometer (CF-IRMS), and SLN (g m -2 ) was calculated by multiplying percent nitrogen with LMA. Across the two training sets, 129 SLN and 169 LMA observations (plots) were obtained, involving 124 unique genotypes.
To generate a maximised dataset and enhance robustness of associating the ground truth data taken in a plot with hyperspectral measurements obtained from the same plot, individual plots, rather than genotypes, were considered as an observational unit.

Canopy Hyperspectral Measurements.
Hyperspectral data captured before anthesis and around the same time as the ground-truthing data (at 58 and 52 DAS in 2019 and 2020, respectively) was used to associate with the ground truth data. At this stage of sorghum crop growth, canopies are fully closed and nitrogen content of individual leaves is expected to be at a maximum as all mainstem leaves are fully expanded, but, prior to any translocation of nitrogen during senescence [47]. A tractor-based field phenotyping platform 3 Plant Phenomics (GECKO; developed at The University of Queensland) which enables simultaneous crop canopy proximal sensing was used [48]. The tractor moves at a constant 1.1 metres per second and is integrated with a GPS real-time kinematic system with 2 cm accuracy to locate sampling plots (individual size of 4:5 × 3 m). A microhyperspectral imager (Micro-Hyperspec VNIR model, Headwall Photonics, Fitchburg, MA, USA) mounted on this phenotyping platform (3 m above ground and~1.7 m above the canopy) was used to obtain the spectral response of each pixel (5 × 5 mm) at 272 spectral wavelengths between 395 and 997 nm (visible and near infrared). The resolution was approximately 2.2 nm with 6.0 nm Full Width Half Maxima. A radiometric calibration (dark signal calibration) of the hyperspectral camera was performed weekly. A spectral calibration using the nominal white and spectral diffusers with specific band sets focused on the highest possible spectral resolution was conducted every three months by comparing their respective responses in almost identical illumination conditions. An automated software data calibration pipeline was used to convert raw digital numbers to reflectance values at each pixel. Pixel reflectance was calculated by the ratio between pixel radiance from the microhyperspectral imager and the reference pixel radiance from an upward sensor measuring incoming radiance. To segment plants from soil and remove background noise from lower canopy levels, a threshold of NDVI > 0:5 was applied for each pixel based on the fractional vegetation cover [27,36,49], which could ensures only spectral information from green leaves is retained for the reflectance calculations and shadows and other background noise are excluded from the hyperspectral images. After masking by NDVI > 0:5, plant pixels within a plot were averaged to calculate reflectance of each plot. All hyperspectral data was collected from 9 am to 12 pm to minimise the effects of relative orientation of the sun, and no adjustments were made for the sensor or the distribution of leaf angles in the masking. As an example, images, radiance, and reflectance pre-and postmasking by NDVI > 0:5 for plot 361 in 2020 are shown in Figure 1.
A set of hyperspectral vegetation indices known to be associated with photosynthesis was computed from the plot reflectance involving 16 wavelengths as shown in Figure 1. The equations used to calculate the indices in this study were summarised in Table 2.
Note: Wavelengths with black bars show the wavelengths used for calculating the set of vegetation indices known to be associated with photosynthesis; wavelengths with red bars indicate the wavelengths involved in the stepwise linear regression (referring to 2.2).

2.5.
Determining V cmax , V pmax , and J max from ACi and Ai Curves. For quantifying the actual photosynthetic parameters, we applied the C 4 photosynthesis model to the measured ACi and Ai response curves [11,14]. The CO 2 assimilation rate (A) in the bundle sheath is given by the minimum of either Rubisco carboxylation limited (A c ) or electron transport limited (A j ) rates: where, where O s is the O 2 partial pressure in the bundle sheath, γ * is the half of the reciprocal of Rubisco specificity, K c and K o are the Michaelis-Menten constant of Rubisco for CO 2 and O 2 , respectively, and R d is the mitochondrial respiration rate in the light. All enzymatic constants and variables in the equations above were detailed in a previous study [8].
The C s (CO 2 partial pressure in the bundle sheath) is modelled by ambient CO 2 (C a ) entering the leaf via stomata and being diffused into the mesophyll, converted into C 4 acids then decarboxylated, and released as CO 2 in the bundle sheath. The supply of CO 2 to the mesophyll (C m ) depends Table 1: Top: mean and maximum daily temperatures, mean daily photosynthetic photon flux, and relative humidity during the two GWAS trials and two training sets in 2019 and 2020; bottom: number of plots and genotypes used in each experiment; and the genotypes in common between trials are in italic. Year

Plant Phenomics
on the intercellular CO 2 partial pressure (C i ), the mesophyll conductance (g m ), and the demand term, which is the CO 2 assimilation rate A: Here, the effects of the leaf boundary layer and stomatal conductance are incorporated into the C i /C a term. The supply of CO 2 to the bundle sheath (C s ) can be limited by enzymatic capacity or chemical energy from the photosynthetic electron transport chain. For the enzyme-limited case, C s is given by where, where g bs is the bundle sheath conductance to CO 2 , R m is the mitochondrial respiration in the mesophyll, and K p is the Michaelis-Menten constant for CO 2 associated with PEP carboxylation. Equations (5) and (6) assume carboxylation of CO 2 by PEP is rate limiting.
The electron transport rate limited CO 2 supply is given by the same equation structure as in (5), but with the "V p " term replaced: where, where x is a partitioning factor of electron transport rate between the C 4 and C 3 cycles (~0.4) and φ is the ATP requirement of the C 4 cycle (~2 ATP). I 2 is the photosynthetically useful light absorbed by PSII (PS abs × incident light) and θ is an empirical curvature factor assumed as 0.3 [11].
Equations (3), (4), (7), and (8) were rearranged and fitted to measured Ai curve to infer J max , θ, and PS abs , which were fed into ACi curve fitting using Equations (2), (4), (5), and (6). Overall, this allows prediction of the Rubisco (V cmax ), PEP (V pmax ), and electron transport (J max ) limited CO 2 assimilation. The fitting was performed using the numerical solver option in Excel which minimises the sum of square errors of A between observed and predicted. The Excel spreadsheet for calculation is shown in Table S1, which Chlorophyll fluorescence p685/p655 [52] r690_r600 Chlorophyll fluorescence p690/p600 [52] r740_r700 p740/p700 Stepwise regression consists of iteratively adding and removing predictors used in the predictive model, in order to find the subset of variables in the dataset resulting in the best performing model that lowers prediction error. It has been used to select spectral wavelengths highly related to leaf nitrogen, lignin, and cellulose concentrations in diverse species [57,58]. Stepwise multilinear regression attempts to model the relationship between two or more explanatory variables and a response variable by fitting a linear equation to observed data [59]. Input variables (vegetation indices) are eliminated according to the Pearson correlation coefficient with dependent variables (leaf properties and photosynthetic parameters), which should indicate the most relevant indices to photosynthesis. However, stepwise multilinear regression often suffers from multicollinearity existing in the predictors [58,60]. In this study, before undertaking stepwise multilinear regression, principal component analysis (PCA) was conducted for the set of hyperspectral vegetation indices in Table 2 to reduce collinearities among them. This resulted in a subset of vegetation indices with reduced correlation between each other which were used in stepwise multilinear regression. The wavelengths used to calculate all the vegetation indices in Table 2 and involved in the subset of vegetation indices are indicated in Figure 1(d). Stepwise multilinear regression using the "MASS" package in R (v 4.0.3) [61] was then conducted to detect the best models for photosynthetic parameters (V cmax , V pmax , and J max ) and key leaf properties (SLN and LMA). The best models for each trait were selected, based on Akaike's Information Criteria (AIC) which is commonly used in model selection with lower values indicating a more parsimonious model than a model with a higher AIC [62]. Coefficient of determination (R 2 ) and root mean squared error (RMSE) were used for model assessment.
2.6.2. Approach 2: Partial Least Square Regression (PLSR) Derived from Spectral Reflectance. In this approach, PLSR was used to correlate the spectra reflectance of all available wavelengths with the photosynthetic parameters (V cmax , V pmax , and J max ) and key leaf properties (SLN and LMA) across TS1 and TS2. PLSR has been commonly used in remote sensing spectroscopy to predict plant biochemical and physiological parameters, being able to handle highly correlated predictors and the case of more predictors than observations [60,63,64]. The "pls" package in R (v 4.0.3) predicted the traits of interest from reflectance of all the 272 wavelengths, via decomposing the predictor matrix into a set of loadings and scores with the objective of maximising covariance between the scores and response [65,66]. This process is repeated for a given number of latent variables as the number of loadings and scores necessary to explain sufficient variance in response. The optimal number of latent variables was taken as the minimum number required to minimise the root mean squared error of prediction while not significantly decreasing the cross-validation error, with a maximum of 25 latent variables being considered. The evaluation of the PLSR models was performed by a leave-one-out cross-validation approach, by training the model on all but one observation and then predicting for the remaining observations [67]. The benefit of many iterations of fitting and evaluating during this cross-validation is that it results in a more robust estimate of model performance as each row of data is given an opportunity to represent the entirety of the test dataset, which is appropriate for a small dataset given the computational cost [68,69]. This cross-validation approach has been applied in remote sensing of wheat leaf area index, maize and tobacco biochemical traits, crop yield forecasting, and poplar tree photosynthetic capacity predicting from spectral measurements [40,42,[70][71][72][73]. The performances of these regression models were assessed using R 2 and RMSE.

Extrapolating the PLSR Models Built across the Training
Sets to the GWAS Trials. To further test the accuracy of the PLSR models built across the training sets, the PLSR models for V cmax , V pmax , J max , SLN, and LMA were used to estimate these traits for each line in the GWAS trials GAT1 and GAT2. Subsequently, GWAS analyses for the two most important photosynthetic parameters (V cmax and J max ) in GAT1 were conducted to identify the underlying genetic loci.

BLUPs for the Traits of Interest in the GWAS Trials.
To minimise environmental and special effects within trials and perform GWAS, the best linear unbiased predictors (BLUPs) of the predicted traits in the GWAS trials were calculated using a restricted maximum likelihood (REML) by fitting a linear mixed model using the ASReml-R package (Equation (9)) [74,75].
where the response vector y is modelled by all the fixed effects β, random effects u, and all the residual effects ε.
The matrix X represents the design matrix for the fixed effects, and the matrix Z is the design matrix for the random effects. The fixed effects were composed of main effects for each trial plus any effects associated with linear changes along the rows and columns. The random effects contained sources of error within each trial including replication and any trial specific random row and column effects. The residual effects included trial specific residual effects and first order autoregressive (AR1) effects in both the row and column directions for each trial. The model included genotype as a random effect to predict genotype BLUPs within trials. All possible sources of variation in the BLUPs were allowed for in the linear mixed model [75]. A generalised measure of heritability was calculated due to the complex variance structure, of which the equation is given by (Equation (10)).

Plant Phenomics
where H 2 is the generalised heritability, σ 2 g represents the genetic variance, and SED is the average standard error of difference [76].
2.7.2. GWAS for V cmax and J max in the GWAS Trial GAT1. All genotypes from the diversity panel used in the GWAS trial GAT1 were resequenced by Diversity Arrays Technology Pty Ltd (http://www.diversityarrays.com). The sequence data was aligned to version v3.1 of the sorghum reference genome sequence [77] to identify SNPs (Single Nucleotide Polymorphisms), resulting in 414,899 SNPs. GWAS analyses were conducted using BLUPs of V cmax and J max predicted by extrapolating the PLSR models from the training sets to the GWAS trial GAT1. Software FarmCPU [78] was used to conduct GWAS, using 302,631 filtered SNPs (minor allele frequency ðMAFÞ > 0:01). A significant threshold was set as Bonferroni-corrected 0.05/number of effective SNPs [79,80], resulting in a threshold of p value < 1.6e-7.

Pathway Enrichment Analyses Based on Genes within
200 kb from the QTL of V cmax and J max . To further evaluate the reliability of extrapolating the PLSR models for V cmax and J max from the training sets to the GWAS trials, pathways enriched for genes around the QTL of V cmax and J max were analysed using PhytoMine of Phytozome v13 (https:// phytozome-next.jgi.doe.gov/phytomine/begin.do), by inputting genes within 200 kb of each QTL detected from the Sor-ghum_bicolor.Sorghum_bicolor_NCBIv3.47.chr.gff3. Genes identified as enriched in the pathways via PhytoMine were defined as candidate genes. μmol m -2 s -1 , V pmax had an average of 410 μmol m -2 s -1 and ranged from 105 to 952 μmol m -2 s -1 , and J max ranged from 227 to 673 μmol m -2 s -1 with a mean of 383. No significant differences were observed in the photosynthetic parameters between the training sets in two years (ANOVA, p > 0:05), and pooled data of observations from individual plots across TS1 and TS2 were used to enrich the results. With the pooled data, a total of 75 ACi and 75 Ai curves were used for fitting V cmax ,V pmax , and J max . However, eight ACi curves could not be fitted sensibly with the C 4 photosynthesis model, possibly due to low data quality caused by high air temperature (> 38°C, Table 1). Given the possible errors from confounding environmental factors in the fittings of V cmax , V pmax , and J max , extreme values (V cmax > 65, V pmax > 750, and J max > 700 μmol m −2 s −1 ) were treated as outliers and excluded from further analyses as shown in Figures 2(a)-2(c), based on their average values. In total, 67 V cmax , 60 V pmax , and 74 J max plot observations were effective for further analyses.

Variation in Ground Truth
SLN varied from 1.6 to 2.4 g m -2 with a mean of 2.0 g m -2 in TS1 and ranged from 1.3 to 2.5 g m -2 with a mean of 1.9 g m -2 in TS2 (Figure 2(d)). Pooled data across the two training sets was used for the estimation of SLN (n = 129 plots) ( Table 1). LMA ranged from 36.0 to 63.5 g m -2 (n = 169 plots) and did not significantly differ between TS1 and TS2 (Figure 2(e)), and data from the two trials were pooled together. No outliers of SLN or LMA were removed from the following analyses, given no extreme values were observed (Figures 2(d) and 2(e)). Thus, in total, 129 SLN and 169 LMA observations were used for association with hyperspectral data. The best models based on the AIC criteria are given in Table 3. All models were significant (p < 0:05) for estimating the photosynthetic parameters, despite the low R 2 of around 0.20 (Table 3). The RMSEs for predicting V cmax , V pmax , and J max were 9%, 35%, and 18% of the mean, respectively, suggesting a modest accuracy in estimations of the photosynthetic parameters from the proximal hyperspectral vegetation indices. Moreover, the vegetation indices detected in the best models for V cmax , V pmax , and J max were mostly based on near infrared (~800 nm), red edge (~710-750 nm), and green (~550 nm) portions of the spectrum (Figure 1(d)), such as CVI, curvature, and OSAVI, which have previously mostly been used as indicators for variation in nitrogen status and canopy size [28][29][30][31]52]. Interestingly, significant association of V pmax and J max with an oxygen-A band based index (r760) was observed, which has been used to predict chlorophyll fluorescence [81], suggesting sensitivity of this region to photosynthesis. An indicator of light use efficiency, PRI (based on 531 and 570 nm), showed a high coefficient in the estimators of V cmax , consistent with the physiological linkages between maximum Rubisco activity and electron transport processes. Red_edge and curvature, known to be sensitive to chlorophyll content [52], were commonly detected in the best stepwise multilinear regression models for SLN and LMA.  Figures 4(a)-4(c)). The RMSEs for estimating V cmax , V pmax , and J max were reduced to 4%, 12%, and 10% of the mean, respectively (Figures 4(a)-4(c)). Model loadings, (Figures 4(d)-4(f)) which indicate the contribution of the wavelengths in a specific PLSR model, highlighted the red edge (685-750 nm) and near infrared (a major peak around 950-960 nm) region as important regions for predicting photosynthetic capacity. Using PLSR derived from reflectance of all wavelengths, the predictions of SLN and LMA improved in both R 2 and RMSE compared with the models developed by stepwise multilinear regression using vegetation indices (Figures 5(a) and 5(b)). For SLN and LMA, the RMSE was reduced to 5% and 6% of the mean, respectively. The R 2 reached 0.82 for SLN and 0.68 for LMA. In the models for SLN and LMA, the wavelengths with high loadings largely fell in the near infrared regions with peaks around 722-769 nm and 922-956 nm (Figures 5(c) and 5(d)).

Extrapolating the PLSR Models Built Using the Training
Sets to the GWAS Trials 3.4.1. Variation and Heritability of V cmax , V pmax , and J max , and SLN and LMA in GAT1 and GAT2. When using the PLSR models built across the two training sets to estimate the traits in the GWAS trials, reasonable ranges and heritability were observed for all the traits, especially for the two key photosynthetic parameters V cmax and J max ( Table 4). The ranges of the predicted V cmax (46-65 μmol m -2 s -1 ) and J max (317-595 μmol m -2 s -1 ) in GAT1 were particularly comparable with the ground truth measurements in the training sets (Figure 2), suggesting a reasonable accuracy of the extrapolations. This was also supported by the high heritability (around 0.90) of V cmax and J max in GAT1 ( Table 4). The heritabilities of the predictions in GAT2 were lower than in GAT1, because most of the genotypes in GAT2   Table 1 across two training sets. Note: Length of each arrow represents loading of each variable on dimension 1 and 2; contrib: contribution of each variable to dimension 1 and 2. 9 Plant Phenomics were hybrids which have less genetic diversity (Tables 1  and 4).

GWAS Based on the Predictions of V cmax and J max in GAT1.
To further evaluate the predictivity of the PLSR models, GWAS analyses were performed on BLUPs of V cmax and J max predictions in GAT1 (n = 649 inbred lines), and given V cmax and J max have been identified to be the two key photosynthetic parameters for determining net rate of canopy photosynthesis [13]. Four QTL were detected to be associated with the variation in V cmax (Figure 6 and Table 5), were located on chromosome 6, 9, and 10, suggesting likely genomic regions associated with the processes of CO 2 assimilation. In terms of J max , two QTL located on chromosomes 4 and 5 were identified, providing likely chromosomal regions relevant to the processes of electron transport.  To further assess the accuracy of the PLSR models from the training sets, the genes within 200 kb [43] from the QTL detected for V cmax and J max were analysed by PhytoMine (https://phytozome-next.jgi.doe.gov/ ). One pathway was enriched for five candidate genes of V cmax , which has been annotated to be associated with UDPG-glucosyl transferase ( Table 6). Another pathway, enriched for four candidate genes of J max , was found to be involved in metabolic processes resulting in the removal or addition of electrons (iron ion binding).

Discussion
In this study, five key photosynthesis related variables were investigated and predicted from canopy hyperspectral reflectance data, providing an efficient and nondestructive tool to screen genotypes for improved photosynthetic capacity at large scale. Maximal Rubisco carboxylation rate (V cmax ), PEP carboxylation rate (V pmax ), and electron transport rate (J max ), which are the main rate-limiting processes in C 4 -carbon assimilation, were quantified in a diverse set of sorghum genotypes across the two training sets (n = 75 plots including 63 genotypes). To date, this is the first attempt to correlate hyperspectral reflectance to detailed fittings of these three parameters from both ACi and Ai curves in C 4 pathway photosynthesis. The obtained V cmax and V pmax values were comparable with those reported previously in sorghum [82]. Compared with stepwise multilinear regression, PLSR models improved the prediction accuracy for the three photosynthetic parameters and the other two key leaf properties (SLN and LMA, n = 169 plots including 124 genotypes), based on R 2 (~0.80) and RMSE (less than 12% of mean). Subsequently, these PLSR models were extrapolated to two GWAS trials (875 plots with 650 genotypes in GAT1; 912 plots with 634 genotypes in GAT2), with the resulting predictions for both photosynthetic parameters and key leaf properties (SLN and LMA) showing medium to high      [42,69,70,[83][84][85]. However, measurements requiring leaf clips are not practical for screening thousands of breeding lines. To fully achieve high-throughput phenotyping, rather than using handheld spectroradiometers on a leaf-by-leaf basis, estimations of photosynthetic capacity from automated proximal or remote sensing at the canopy level are needed. Apart from greater throughput, canopy measurements also better reflect the whole-plant, which integrates photosynthetic activities measured at the leaf level.
Canopy hyperspectral reflectance has shown promise for estimating V cmax and net canopy photosynthetic rate through different approaches, such as airborne-based model inversion in wheat [34]. Another study using canopy hyperspectral reflectance also successfully predicted V cmax and J max with a ground-based phenotyping platform in tobacco [86]. Moreover, these authors compared three different PLSR approaches including reflectance-based, index-based, and model inversion-based methods, indicating better performance in models based on reflectance and indices than model inversion [86]. A comparison based on leaf-and plot-level PLSR models confirmed the capability of plotlevel hyperspectral imaging to predict photosynthetic parameters in transgenic tobacco plants expressing C 4 photosynthesis pathway genes [40]. In the present study, across 63 sorghum varieties in the training sets, V cmax , V pmax , and J max were predicted with reasonably high accuracy (R 2 around 0.80 and RMSE within 12% of mean) using PLSR models built from canopy hyperspectral data collected via a proximal phenotyping platform (~1.7 m from canopy). The index-based stepwise multilinear regression models for V cmax and J max could also estimate the photosynthetic parameters with a reasonably small RMSE around 13% of mean, although with much less percentage of variance explained (R 2 around 0.20). The results from the present study demonstrate the promise of utilising hyperspectral sensing at a canopy level in selective breeding for photosynthetic capacity at large scale and put forward a highthroughput tool to explore genotype by environment interactions of photosynthetic capacity related traits.

Models of SLN and LMA.
Nitrogen content has been one of the most successfully predicted traits in crops from both leaf and canopy spectral measurements [20,87,88]. In addition to the biochemical parameters, and given the strong associations of nitrogen and LMA with photosynthesis, remote sensing of the key leaf properties has also previously been explored, [16,89,90]. Among the estimations from PLSR models in this study, a high coefficient of determination was consistently observed in SLN predictions (R 2 = 0:82), which also had a low RMSE in stepwise multilinear models (10% of mean SLN), demonstrating the effectiveness and suitability of approaches applied in this study.
Another key leaf property, LMA, has been identified as a proxy of photosynthetic capacity in maize [89]. Robust models for predicting LMA from leaf-level hyperspectral reflectance have been reported for wheat and soybean [18,35,91]. Additionally, lower RMSE at canopy level than leaf level has been reported for LMA estimations, as multiple scattering in the upper canopy leaf layers could strengthen the expression of key leaf properties in a closed canopy compared with leaf-level measurements [83]. A more recent study in the C 3 crop zucchini using both leaf-and canopylevel hyperspectral reflectance and PLSR has successfully predicted LMA with R 2 of 0.91 and 0.60, respectively [92]. In the present study, low RMSE (6% of mean LMA) and medium to high R 2 of 0.68 were found in the LMA estimations from canopy hyperspectral reflectance using PLSR. This was also supported by LMA predictions from the stepwise multilinear regression with an acceptable RMSE, 10% of mean. These results indicate that proximally sensed and canopy-based hyperspectral reflectance measurements provide a rapid and robust measure of key leaf properties related to photosynthetic efficiency.

Potential Strategies to Train Robust Models for
Predicting Leaf Traits from Canopy-Based Sensing. When using canopy-level hyperspectral data to train leaf-level measurements, shadows, soil background, and canopy structure could be complicating factors that affect the robustness of the model. To address some of the issues with using canopy reflectance, a NDVI > 0:5 mask was applied to each pixel used in the reflectance calculation. This masked out the soil background reflectance and thus minimising the variation in spectral responses from effects associated with canopy heterogeneity (e.g., light or temperature) at the plot level. In addition, some of the noise from canopy structural factors was also minimised in this study by operating within one critical growth stage. However, for future application, developing models suitable for different stages or less sensitive to the variation of canopy structure within a time window would improve utility of the method developed here. Additionally, an automatic thresholding technique (e.g., Otsu) fused with canopy height from LiDAR could be applied in canopy delineation which should be more accurate [93] in delineating the exact canopy areas within a plot. This could reduce spurious reflectance values and thus increase the signal measured from proximal sensing at the canopy level, depending on agricultural contexts (e.g., species or canopy size). Alternatively, combining relevant models that improve the relationships between canopy hyperspectral reflectance and leaf photosynthetic parameters could be useful [34]. Increasing the number of ground truth samples can also improve model performance; however, simply increasing the size of the dataset not only leads to highly complex models but is also affected by the high costs associated with additional measurements, especially in the case of gas exchange 13 Plant Phenomics measurements which are notoriously slow to obtain [25,94]. To date, gas exchange measurements are the only realistic measurement of photosynthesis; however, given the confounding factor of variation in photosynthetic capacity within crop canopies of the same genotype [40], this is not ideal.
Reducing the confounding environmental factors (e.g., light or temperature) will also improve model strength when using canopy-based hyperspectral sensing methods to estimate key leaf traits. In this study, all ground truth and sensing data was collected between 9 am and 12 pm, which minimised the effects of sun angle, temperature, and light on canopy reflectance and on photosynthetic rates. Further improvement could be made by incorporating temperature at the time of image capture and tentatively correcting photosynthetic parameters to a standard temperature, as it is one of the most important environmental factors influencing both hyperspectral reflectance and photosynthesis. This was not considered here due to scarce documentation of temperature responses of Vcmax, Vpmax and Jmax in C 4 crops [39].

PLSR Derived from Entire Wavelength Spectrum
Strengthens Model Performance. Compared with the models developed using stepwise multilinear regression, PLSR models were more robust and demonstrated a higher cross validated R 2 and lower RMSE. This is attributed to the fact that additional spectral information was incorporated in the PLSR models using the complete wavelength range compared with the stepwise multilinear regression models [36,63,83,86]. Based on peak loadings (red edge and near infrared), the wavelengths that explained most of the variance in the PLSR models aligned closely with the locations of the wavelength bands selected to develop the best-performing multilinear vegetation index approach. Compared with the published indices that correlate with nitrogen content, a strong overlap was found around the red edge (~710-750 nm) in the present study, consistent with the finding that leaf nitrogen content is linearly correlated with the first derivatives of reflectance at the red edge region around 730 nm [20]. The most important parts of the spectrum for predicting photosynthesis have been shown to be in the visible (400-700 nm) and red edge (710-750 nm) range [83]. In this study, the spectral loadings used to predict photosynthetic parameters had similar peaks to the spectral loadings of SLN and LMA, likely attributed to these features being interdependent [89]. These results provide useful information for selecting relevant wavelengths to predict the traits of interest in further studies.

PLSR Models Built across the Training Sets Can Be
Extrapolated to the GWAS Trials. In this study, the PLSR models were extrapolated to the GWAS trials, demonstrating comparable variation and high heritability (~0.80) for the predicted biochemical (V cmax , V pmax , and J max ) and key leaf properties (SLN and LMA) in the GWAS trial (GAT1), including predominantly inbred lines. Based on the predictions for these traits in the GWAS trial (GAT2), comprising mostly hybrid lines, relatively lower heritability (~0.5) was observed, as expected, due to similarity among the hybrids both at the molecular and phenotypic level. This suggests hyperspectral sensing is a promising avenue to screen large populations for such traits that have previously been out of reach of crop breeding programs [34,42,95]. However, the capacity of green leaves to convert CO 2 into biomass varies throughout the season mainly due to interactions among genotypes, plant phenological stage and environment [13,96]. This is likely to further influence predictive skill especially in cases where there is a high within-population variation as a result of the genotype by environment interactions.
The models built across the training sets show sufficient skill to estimate key determinants of photosynthesis in large sorghum mapping populations, grown adjacent to these ground-truth trials, despite potential challenges of predicting leaf photosynthetic capacity from canopy-based hyperspectral sensing. This would not only enable the screening for materials with improved photosynthetic capacity, following identification of genetic loci and potential candidate genes for photosynthetic capacity in the C 4 crop sorghum but also benefit the quantification of the association between photosynthetic capacity and ultimate biomass improvement in crops. In further applications, it is important to select the best phenology stage for data collection, when the degree of canopy development expressed by leaf area index has more consistent levels of pigment concentration per unit area and more similar spectral response for reducing the impact of such confounding effects associated with plant growth processes (e.g., canopy structure and nitrogen status), [31]. Additionally, further studies to test temporal stability of relationships between canopy reflectance spectra and leaf photosynthetic capacity are needed before extrapolated associations from a specific hyperspectral measurement through the growing season can be made in other crops or agricultural contexts.
Here, GWAS analyses for the two photosynthetic parameters, V cmax and J max , provided useful information for further fine mapping to identify potential candidate genes controlling CO 2 assimilation and electron transport in sorghum. This is one of the significant and novel outcomes from this study, as this is the first attempt to quantify the genetic basis of the key photosynthetic parameters using hyperspectral sensing in hundreds of lines. Additionally, pathway enrichment analysis for genes within 200 kb from J max QTL detected four candidate genes involved in the process of electron transport and light signalling [97]. This means the PLSR model for J max built across the training sets was able to capture the genomic loci associated with its phenotypic variation in the sorghum diversity panel. The pathway enriched for genes within 200 kb from the V cmax QTL is known to catalyse the transfer of a hexosyl group from one compound to another, as well as function in nitrogen storage [98]. While this is not directly associated with Rubisco activity per se, plant nitrogen status is closely associated with Rubisco and leaf photosynthetic rates [18][19][20]. Additionally, the photosynthetic capacity is colimited by Rubisco activity (V cmax ) and RuBP regeneration, which depends on electron transport (J max ) and the coordination of Calvin cycle enzymes [11,93]. Enzyme interactions in the Calvin cycle are highly complex [92], and further studies 14 Plant Phenomics are needed to explore the relevance of the V cmax QTL detected here.

Conclusions
Being able to map crop traits associated with improved resource use efficiency (e.g., nitrogen, light, and water) will contribute to further understanding of the natural variation in photosynthetic processes and enable the exploration of opportunities to modify photosynthesis. This study developed a model using PLSR to estimate maximal Rubisco activities (V cmax , R 2 = 0:83), maximal PEP activities (V pmax , R 2 = 0:93), maximal electron transport activities (J max , R 2 = 0:76), specific leaf nitrogen (SLN, R 2 = 0:82), and leaf mass per leaf area (LMA, R 2 = 0:68) from proximal hyperspectral sensing using two combined training sets (n = 169 plots). Further, extrapolating the PLSR models built across the training sets to the GWAS trials including hundreds of lines demonstrates that the predictions of the traits of interest are heritable. GWAS analyses for J max in the inbred lines detected genomic regions comprising candidate genes controlling the process of electron transport. While the V cmax candidate genes identified here are not associated directly with Rubisco activity per se, they are involved in nitrogen storage which is closely associated with Rubisco. These results suggest that the PLSR models from the training sets were able to capture the phenotypic variation in the photosynthetic parameters allowing the discovery of the underlying genetic basis of these important traits.

Data Availability
All phenotypic data used to develop the models presented in this manuscript is available here: https://doi.org/10.48610/ acbe0df. Genotypic marker data used for GWAS is available upon request to the corresponding author.

Conflicts of Interest
The author(s) declare(s) that there is no conflict of interest regarding the publication of this article.

Acknowledgments
We would like to thank Glen Roulston, Kate Jordan, Janet Roberts, Jane Heron, and James Heron for assistance with data collection and the farm staff at Gatton Research Facilities and staff from the Queensland prebreeding for experiment management and Prof Susanne von Caemmerer for advice with collecting and interpreting gas exchange data.
X. Z. was financially supported through a University of Queensland Research Training Scholarship. This study was partially funded by the Centre of Excellence for Translational Photosynthesis, Australian Research Council (grant CE140100015) and the Bill & Melinda Gates Foundation (grant OPPGD1197 iMashilla "A targeted approach to sorghum improvement in moisture stress areas of Ethiopia").